PATH Mac OS X Server Documentation > Mac OS X Server Release Notes

Mac OS X Server Developer Release Notes:
Apple Information Access Toolkit

These release notes describe aspects of the Apple Information Access Toolkit (AIAT). AIAT is a "search engine" that implements indexing and retrieval of documents, as well as less familiar operations such as the automatic routing of similar documents and the automatic generation of document summaries. Since it is strictly an engine, AIAT has no user interface. It does support a particular style of user interaction, described in Apple Information Access Toolkit Programmer's Guide and these release notes.

Notes Specific to Developer Release

AIAT on Rhapsody

This document assumes that you have read the "Apple Information Access Toolkit Programmer's Guide" and are familiar with the various components of the IAT and the terminology used in that document. The Additions to the Apple Information Access Toolkit document provides updated reference material for that guide.

The main IAT functionality is designed to be platform-independent. However, there is still a need to provide platform-specific subclasses in storage and corpus subsystems of the IAT framework. These subclasses provide the functionality required to manage storage and access documents.

Storage

IAT provides classes that facilitate the storage of blocks of data into persistent storage. This storage is used by IAT to hold the information access indexes and structures. Indexes require persistent storage; this set of logical storage classes provides an interface to the storage media desired to hold the index information. Developers may also use these storage classes to store other data they wish to make persistent.

UFS Implementation

The IAT storage architecture is designed to be platform-independent, but you should use platform-specific subclass implementations to optimize performance. IAT provides a MacOS-specific implementation of storage that uses the Macintosh HFS file system. IAT also provides a Rhapsody-specific implementation of storage that uses the Rhapsody file system.

Creating New Storage

MakeFileStorage is an implementation of the utility which constructs storage for a Rhapsody file. You must know the full path name to the storage file name before you can construct a Rhapsody storage.

Following creation, initialize the storage for use. This initialization creates the structures used to address blocks and opens the storage for writing.

Sample Code to Create Storage

#include "UFSStorage.h"
// Client must provide the storage file name:
StringPtr storageFileName = "storage.file"; // defaults to the current directory
// create storage
IAStorage* anIAStorage = MakeFileStorage (storageFileName);
anIAStorage->Initialize();

Opening Existing Storage

To open an existing storage requires a storage object; opening restores data from persistent storage to the object. Storage may be opened as read-only or read and write access. Open(True) allows writes.

Sample Code for Establishing Existing Storage

#include "UFSStorage.h"
// Client must provide the storage file name:
StringPtr storageFileName = "storage.file; // defaults to the current directory
bool writable = true;
// create storage
IAStorage* anIAStorage = MakeFileStorage (storageFileName);
anIAStorage->Open(writable);

Allocating and Deallocating Blocks of Storage

The base unit of storage is a block. A block is a contiguous set of data that is written or read from storage as a whole. Individual bytes, words, or strings are accessed in the block once it is in memory. A block has a block ID that uniquely identifies it. This ID is of type IABlockID.

The storage object maintains a table of allocated blocks that maps each block to a specific location in physical storage. Objects using storage must know which block contains their desired data. They can do this by maintaining their own table of contents of storage, or they can request a named block in the internal storage table of contents and keep track of that block name rather than its ID. In this case, the storage maintains an internal table, known as the TOC (for "Table Of Contents"), which maps the block names to block IDs.

The following example allocates new UFSStorage by a named block. When a block of storage is first created, it is always an output block, which will allow data to be written to the block.

// create storage
IAStorage* anStorage = MakeFileStorage(fileName);
anStorage ->Initialize();
const char* aBlockName = "MY NAMED BLOCK";
// ask for a new block to be labeled with the given name
IABlockID anIABlockID = anStorage->AllocateNamedBlock(aBlockName);
IAOutputBlock anIAOutputBlock(anStorage, anIABlockID, anIABlockSize);

The example below establishes a named block of storage.

// create storage object
bool writable = true;
IAStorage* anStorage = MakeFileStorage(fileName);
anStorage ->Open(writable);
// get the pre-defined block ID
const char* aBlockName = "MY NAMED BLOCK";
IABlockID anIABlockID = anStorage->TOC_Get(aBlockName);
IAInputBlock anIAInputBlock(anStorage, anIABlockID);

You can allocate storage directly without using a named block using the Allocate() function. This function returns a block ID which the application must keep track of.

Delete storage by deallocating a block using the Deallocate(anIABlockID) function for unnamed blocks, or the RemoveNamedBlock(blockName) function for named blocks.

WARNING
If you use Deallocate() to delete a named block (instead of RemoveNamedBlock()), you will leave the TOC entry for that name untouched. Unless you do a matching TOC_Remove(), you will render that name unusable for the remaining life of the index.

Creating Storage Subclasses

You may need to create a storage subclass if your persistent storage needs to be based on something other than the Macintosh HFS file system.

The IAStorage, IAInputBlock, and IAOutputBlock classes do not require a specialized subclass. But you need to subclass IAStoreStream, and you need to create a new utility to construct your storage.

Creating a Storage Construction Utility. You create storage by creating a store stream and then an IAStorage object. To construct storage, you must invoke the default construction utility IAMakeFileStorage(IAStoreStream*). By supplying your file type's store stream, you effectively create your file type's storage subclass. The following code shows a storage construction utility built to create Rhapsody storage.

A Utility to Construct Storage (UFSStorage.h)

#include "UFSStorage.h"
#include "UFSStoreStream.h"
IAStorage* MakeFileStorage(const char* pathname) {
     return IAMakeStorage(new UFSStoreStream(pathname));
}

Creating a Subclass of IAStoreStream. Because IAStoreStream is an abstract base class, it requires a subclass to do the actual storage of input and output; a subclass must support the actual storage I/O for a specific platform. See the documentation of the IAStoreStream class for detailed information. The class declaration below shows the Rhapsody implementation of IAStoreStream and its functions as an example.

UFSStoreStream

class UFSStoreStream : public IAStoreStream {
public:
        UFSStoreStream(const char* name);
       ~UFSStoreStream();
     void    Initialize();
    void    Open(bool writable);
     bool    IsOpen();
    bool    IsWritable();
    bool    IsClone();
     void    Flush();
     uint32  GetEOF();
    void    SetEOF(uint32 address);
     virtual IAStoreStream* Clone();
     char* GetFileName() const {return fFileName;}
 protected:
     UFSStoreStream( const char* fileName, FILE* fd, bool isOpen, 
        bool isWritable); // clone will use
     void    Write(uint32 address, const byte* data, uint32 length);
    uint32  Read(uint32 address, byte* data, uint32 length);
 private:
     //...
};

Corpus Overview

In the field of information retrieval a "corpus" is a collection of documents that is being searched. In IAT a corpus class provides the tools for identifying a set of documents as a collection and providing text from these documents so they can be indexed. The corpus is the interface between documents and the index. The corpus locates the document files and provides buffered text from these documents to the index and analysis objects. The corpus maintains the location of the collection of documents and, optionally, provides an iterator through them.

IAT provides an implementation that supports Rhapsody files and interfaces to the collection of files within an Rhapsody directory. There are two implementations of the corpus abstract classes. UFSCorpus provides access to the content in a Rhapsody file. UFSDirectoryCorpus provides, in addition, the ability to iterate through a directory and its subdirectories and select documents.

Corpus: Common Procedures

Using a Corpus to Provide Documents

Use the corpus document iterator to provide all documents currently in the corpus, whether or not they are indexed. The following code example illustrates how to list all documents in a Rhapsody directory.

// build the corpus
UFSDirectoryCorpus aDirectoryCorpus(directoryPathName);
// get an iterator through the corpus
IADocIterator* anIADocIterator = aDirectoryCorpus.GetDocIterator();
UFSDirectoryDoc* directoryDoc;
while (directoryDoc = (UFSDirectoryDoc*)anIADocIterator->GetNextDoc()){
    // NULL when no more text docs in folder
    printf("");
    PrintDocName(directoryDoc);
    printf("");
}

Creating a New Corpus

A corpus is stored though its index. Generally a corpus is created at the same time an index is created. See the documentation on creating an index.

// choose a corpus implementation
#include "UFSDirectoryCorpus.h"
// choose an analysis implementation
#include "SimpleAnalysis.h"
// choose an index implementation
#include "InVecIndex.h"
// get the user information (using constants for the sake of this example)
char* name = "recipes.index";
char* UFSDirectorName = "/myroot/Corpora/recipes";
 // ... do your storage stuff here 
 // create index for directory (creates corpus and analysis)
InVecIndex anInVecIndex(aStorage, new UFSDirectoryCorpus(UFSDirectorName),
    new SimpleAnalysis());

Establishing an Existing Corpus

The corpus is stored through its index. To establish an existing corpus, you must first establish its index and then address the corpus data member. The corpus is stored in the index as an IACorpus.

// establish the existing index containing the corpus 
 // get the user information (using constants for the same of this example) 
char* storageName = "recipes.index"; 
char* UFSDirectoryName = "/myroot/Corpora/recipes"; 
bool writable = true; 
/ reestablish storage for the index 
IAStorage * aStorage = MakeFileStorage(storageName); 
IADeleteOnUnwind delInxStorage(aStorage); 
aStorage ->Open(writable); 
// reestablish index for folder (reestablishes corpus and analysis) 
InVecIndex anInVecIndex(aStorage, new UFSDirectoryCorpus(UFSDirectoryName),
                       new SimpleAnalysis()); 
anInVecIndex.Open();

Creating Corpus Subclasses

If you need to create a corpus subclass, you generally need to create several subclasses:

One of IACorpus, to characterize the set of documents
One of IADoc, to provide information that uniquely identifies and locate single documents
One of IADocText, to obtain a text string from the document.

You may also need to provide a subclass of IADocIterator if you wish to provide an index Update() function.

UFSCorpus

The UFSCorpus class characterizes a set of documents. It contains information identifying the directory parentage of the documents. You can use a UFSCorpus object to extract text from a Rhapsody file. This class is defined in the header file UFSCorpus.h.

class UFSCorpus : public IACorpus {
public:
   UFSCorpus() : IACorpus(UFSCorpusType) {}
   virtual ~UFSCorpus() {};
 // IACorpus methods
    IADoc*     GetProtoDoc();
   IADocText* GetDocText(const IADoc* doc);
private:
   // ...
};

UFSDoc

UFSDoc contains the information to locate a document: its full path name. IADoc is the abstract class for the interface to the physical document. Any implementation must contain the data required to locate the actual document. An implementation of an IADoc sublcass requires a matching implementation of an IADocText subclass. The UFSDoc class is defined in the header file UFSCorpus.h.

class UFSDoc : public IADoc {
public:
     UFSDoc() : fPathName(NULL) {}
    UFSDoc(const byte* p, bool makeCopy = true);
     virtual      ~UFSDoc();
     IAStorable*  DeepCopy() const;
    uint32       StoreSize() const;
    void         Store(IAOutputBlock* output) const;
    IAStorable*  Restore(IAInputBlock* input) const;
     bool         LessThan(const IAOrderedStorable* neighbor) const;
    bool         Equal(const IAOrderedStorable* neighbor) const;
     byte*        GetName(uint32 *length) const;
    void         SetName(const byte* fullpath);
     byte*        GetPath() const {return fPathName;}
 protected:
     void         DeepCopying(const IAStorable* source);
    void         Restoring(IAInputBlock* input, const IAStorable* proto);
 private:
     // ...
};

UFSDocText

IADocText provides the text from the actual document. An implementation of this must be able to locate the document and read its contents. This class is defined in the header file UFSCorpus.h.

class UFSDocText : public IADocText {
public:
       UFSDocText() : fStream(NULL) {}
       UFSDocText(const byte* path);
       ~UFSDocText();
        uint32 GetNextBuffer(byte* buffer, uint32 bufferLength);
       IADocText* DeepCopy() const;
 private:
       // ...
};

Creating a Subclass of IADocIterator

The IADocIterator will locate the documents in the corpus in sequence. Hear is a sample header file for an IADocIterator subclass:

class UFSDirectoryCorpusIterator : public IADocIterator {
public:
       UFSDirectoryCorpusIterator(UFSDirectoryCorpus* c)
             : corpus(c), ufsIterator(new UFSIterator(c->GetFullPath())) {}
       ~UFSDirectoryCorpusIterator() {delete ufsIterator; }
       IADoc*              GetNextDoc();
private:
       UFSDirectoryCorpus* corpus;
       UFSIterator*        ufsIterator;
};

And here is a sample Implementation for GetNextDoc():

IADoc* UFSDirectoryCorpusIterator::GetNextDoc() {
    while (ufsIterator->Increment()) {
        struct stat* info = ufsIterator->GetFileInfo();
        return new UFSDirectoryDoc(corpus,
            ufsIterator->GetPath(),
            ufsIterator->GetFileName(),
            info->st_mtime);
    }
    return NULL;
}

UFSIterator

UFSIterator returns any file from a given directory. It recurses all folders to get to the actual files. The Increment()member function returns true if a file has been found and false if there are no more files within the directories. You can obtain file information (stat) with the GetFileInfo() function and additional directory GetPath() and GetFileName() functions. This class is defined in the header file UFSIterator.h.

class UFSIterator : public IAObject {
public:
       UFSIterator(const byte* pathname);
      ~UFSIterator();
      bool              Increment();
       struct stat*      GetFileInfo() const {return fStat;}
       byte*             GetPath() const;
      byte*             GetFileName() const;
 protected:
      // Accessors needed to override Increment()
      DirectoryInfo*    GetDirInfos() const {return fDirInfos;}
      long              GetDirCount() const {return fDirCount;}
      uint32            GetDir() const {return fDir;}
      void              CollectDirInfo(const byte* name);
 private:
      // ...
};

Example Using UFSIterator

while (ufsIterator->Increment()) {
    struct stat* info = ufsIterator->GetFileInfo();
    byte* name = ufsIterator->GetFileName();
    // filter out non-valid and old documents
    // the definition for ValidType(), is up to you!
    if (info->st_mtime == today && ValidType(name)) {
        return new UFSDirectoryDoc(corpus,
            ufsIterator->GetPath(),
            ufsIterator->GetFileName(),
            info->st_mtime);
    }
}

UFSDirectoryCorpus

The UFSDirectoryCorpus is a subclass of the IACorpus class. It maintains an iterator that, given a directory, returns documents within that directory and, recursively, in subdirectories. It chooses only documents that satisfy the client-defined criteria. Because UFSDirectoryDoc contains a modification date, only those selected documents (files) modified since the last update are submitted for re-analysis.

The client registers a function (using SetCriteriaFunction member function) which returns true if the current document satisfies the client-defined criteria (for example, if it is the right type) and otherwise returns false. By default all documents are selected. This class is defined in the header file UFSDirectoryCorpus.h.

typedef bool DocumentTypeFn(const char* fileName); // criteria function type
 class UFSDirectoryCorpus : public IACorpus {
public:
       UFSDirectoryCorpus(uint32 type = UFSDirectoryCorpusType);
       UFSDirectoryCorpus(const byte* rootDirPath, uint32 type =UFSDirectoryCorpusType);
       virtual ~UFSDirectoryCorpus();
        // IACorpus methods
       IADoc* GetProtoDoc(); // this will return UFSDirectoryDoc
       IADocText* GetDocText(const IADoc* doc); // this will return an UFSDocText
        IADocIterator* GetDocIterator();
        // UFSDirectoryCorpus specific methods
       uint32 GetDirectoryID(const byte* fullPath); // allocate id for path.
       // returns an IAArray of path
       byte*        GetDirectory(uint32 directoryID, uint32 *length);
       byte*        GetFullPath() const {return fRootDirectory;}
       void         SetCriteriaFunction(DocumentTypeFn* func);
       DocumentTypeFn*   GetCriteriaFunction() const;
protected:
       IABlockSize  InitialSize();
       void         Initializing(IAOutputBlock* output);
       void         Opening(IAInputBlock* input);
       IABlockSize  UpdateSize();
       void         Updating(IAOutputBlock* output);
       UFSDirectoryInfo** GetDirectoryInfos () const {return fDirectoryInfos;}
       uint32       GetDirectoryCount() const {return fDirectoryCount;}
       void         DeleteDirectoryInfos();
private:
       // ...
 };

UFSDirectoryDoc

Recall that IADoc is the abstract class for the interface to the physical document. Any implementation of a concrete subclass must provide the data required to locate the actual document. Any implementation of an IADoc subclass requires a matching implementation of an IADocText subclass. However, the GetDocText() member function of UFSDirectoryCorpus returns an UFSDocText, and therefore we don't define an UFSDirectoryDocText class. The UFSDirectoryDoc class is defined in UFSDirectoryCorpus.h.

class UFSDirectoryDoc : public IADoc {
public:
       UFSDirectoryDoc(UFSDirectoryCorpus* corpus,
              const byte* path, const byte* file, long date);
       UFSDirectoryDoc();
       virtual ~UFSDirectoryDoc();
        IAStorable* DeepCopy() const;
       uint32      StoreSize() const;
       void        Store(IAOutputBlock* output) const;
       IAStorable* Restore(IAInputBlock* input) const;
        bool        LessThan(const IAOrderedStorable* neighbor) const;
       bool        Equal(const IAOrderedStorable* neighbor) const;
        byte*       GetName(uint32 *length) const;
       uint32      GetDirectoryID() const;
        void        SetModDate (long mDate) {fModDate = mDate;}
       uint32      GetModDate () const {return fModDate;}
        uint32      GetDirID() const {return fDirID;}
       byte*       GetFileName() const {return fName;}
protected:
       void        DeepCopying(const IAStorable* source);
       void        Restoring(IAInputBlock* input, const IAStorable* proto);
private:
       // ...
};

UFSDirectoryInfo

An object of the UFSDirectoryInfo class helps to reduce the size of corpus-related information in the index. It maps a directory ID (IAT generated) to parentage path names and the creation date. This mapping eliminates the need for storing full path names for each document in the same directory in the index. UFSDirectoryCorpus uses a collection of UFSDirectoryInfo instances for looking up directory IDs. This class is defined in UFSDirectoryCorpus.h.

class UFSDirectoryInfo : public IAStorable {
public:
       UFSDirectoryInfo();
       UFSDirectoryInfo(const byte* pathname);
       ~UFSDirectoryInfo();
       // methods to store a UFSDirectoryInfo
       IABlockSize    StoreSize() const;
       void           Store(IAOutputBlock* output) const;
       IAStorable*    Restore(IAInputBlock* input) const;
       IAStorable*    DeepCopy() const;
        byte*          GetDirectoryName() const {return fPath;}
       uint32         GetCreationDate() const {return fCreationDate;}
        void           SetDirectoryName(const byte* pathname) {fPath = (byte*)pathname;}
       void          SetCreationDate(uint32 cDate) {fCreationDate = cDate;}
private:
       // ....
};

Exceptions

UFSDirectoryNotFound - Specified directory is not found
UFSError - File System Error

Document Routing

UFSCluster

A category for related documents is called a cluster. Clusters are represented by the IACluster class, which must be subclassed to handle particular document types. For example, the IAT provides the subclass UFSCluster, which represents a cluster of Rhapsody documents.

class UFSCluster: public IACluster {
public:
       UFSCluster (IAIndex* index, const byte* path);
       virtual    ~UFSCluster();
       IADoc*     GetNextDoc() const ; // returns the next document in the cluster.
       void       Reset(); // reset to the first document
};

Mac OS X Server Developer Release Notes: Apple Information Access Toolkit